xxxxxxxxxx::page{title="Business AI Meeting Companion STT"}::page{title="Business AI Meeting Companion STT"}
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-WD0231EN-SkillsNetwork/IDSN-logo.png" width="200" alt="cognitiveclass.ai logo" />
## Introduction
Consider you\'re attending a business meeting where all conversations are being captured by an advanced AI application. This application not only transcribes the discussions with high accuracy but also provides a concise summary of the meeting, emphasizing the key points and decisions made.
In our project, we\'ll use OpenAI\'s Whisper to transform speech into text. Next, we\'ll use IBM Watson\'s AI to summarize and find key points. We\'ll make an app with Hugging Face Gradio as the user interface.
### Learning Objectives
After finishing this lab, you will able to:
- Create a Python script to generate text using a model from the Hugging Face Hub, identify some key parameters that influence the model\'s output, and have a basic understanding of how to switch between different LLM models.
- Use OpenAI\'s Whisper technology to convert lecture recordings into text, accurately.
- Implement IBM Watson\'s AI to effectively summarize the transcribed lectures and extract key points.
- Create an intuitive and user-friendly interface using Hugging Face Gradio, ensuring ease of use for students and educators.
<center>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0V2VEN/images/DALL%C2%B7E%202024-02-29%2014.12.41%20-%20In%20a%20minimalist%20meeting%20room%20with%20a%20large%2C%20plain%20round%20table%2C%20a%20small%20and%20simple%20digital%20display%20is%20mounted%20on%20a%20white%20wall.%20The%20display%20shows%20%27Key%20Po.webp" width="500px" alt="langchain">
</center>
<center>
Generated by DALLE-3
</center>
## Preparing the environment
Let\'s start with setting up the environment by creating a Python virtual environment and installing the required libraries, using the following commands in the terminal:
```bash
pip3 install virtualenv
virtualenv my_env # create a virtual environment my_env
source my_env/bin/activate # activate my_env
```
Then, install the required libraries in the environment (this will take time ☕️☕️):
```bash
# installing required libraries in my_env
pip install transformers==4.35.2 torch==2.1.1 gradio==4.44.0 langchain==0.0.343 ibm_watson_machine_learning==1.0.335 huggingface-hub==0.19.4
```
Have a cup of coffee, it will take a few minutes.
```
) (
( ) )
) ( (
_______)_
.-'---------|
( C|/\/\/\/\/|
'-./\/\/\/\/|
'_________'
'-------'
```
We need to install `ffmpeg` to be able to work with audio files in python.
```bash
sudo apt update
```
Then run:
```bash
sudo apt install ffmpeg -y
```
> Whisper from OpenAI is available in [github](https://github.com/openai/whisper). Whisper\'s code and model weights are released under the MIT License. See [LICENSE](https://github.com/openai/whisper/blob/main/LICENSE) for further details.
::page{title="Step 1: Speech-to-Text"}
Initially, we want to create a simple speech-to-text Python file using OpenAI Whisper.
You can test the sample audio file **Sample voice [link to download.](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX04C6EN/Testing%20speech%20to%20text.mp3)**
Create and open a Python file and call it `simple_speech2text.py` by clicking the link below:
::openFile{path="simple\_speech2text.py"}
Let\'s download the file first (you can do it manually, then drag and drop it into the file environment).
```python
import requests
# URL of the audio file to be downloaded
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX04C6EN/Testing%20speech%20to%20text.mp3"
# Send a GET request to the URL to download the file
response = requests.get(url)
# Define the local file path where the audio file will be saved
audio_file_path = "downloaded_audio.mp3"
# Check if the request was successful (status code 200)
if response.status_code == 200:
# If successful, write the content to the specified local file path
with open(audio_file_path, "wb") as file:
file.write(response.content)
print("File downloaded successfully")
else:
# If the request failed, print an error message
print("Failed to download the file")
```
Run the Python file to test it.
```bash
python3 simple_speech2text.py
```
You should see the downloaded audio file in the file explorer.
<center>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX04C6EN/images/whisper_download.jpg.jpg" width="300px" alt="langchain">
</center>
Next, implement OpenAI Whisper for transcribing voice to speech.
You can override the previous code in the Python file.
```python
import torch
from transformers import pipeline
# Initialize the speech-to-text pipeline from Hugging Face Transformers
# This uses the "openai/whisper-tiny.en" model for automatic speech recognition (ASR)
# The `chunk_length_s` parameter specifies the chunk length in seconds for processing
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-tiny.en",
chunk_length_s=30,
)
# Define the path to the audio file that needs to be transcribed
sample = 'downloaded_audio.mp3'
# Perform speech recognition on the audio file
# The `batch_size=8` parameter indicates how many chunks are processed at a time
# The result is stored in `prediction` with the key "text" containing the transcribed text
prediction = pipe(sample, batch_size=8)["text"]
# Print the transcribed text to the console
print(prediction)
```
Run the Python file and you will get the output.
```bash
python3 simple_speech2text.py
```
<center>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX04C6EN/images/whisper%20simple%20result.jpg" width="600px" alt="langchain">
</center>
In the next step, we will utilize `Gradio` for creating interface for our app.
::page{title="Gradio interface"}
## Creating a simple demo
Through this project, we will create different LLM applications with Gradio interface. Let\'s get familiar with Gradio by creating a simple app:
Still in the `project` directory, create a Python file and name it `hello.py`.
Open `hello.py`, paste the following Python code and save the file.
```python
import gradio as gr
def greet(name):
return "Hello " + name + "!"
demo = gr.Interface(fn=greet, inputs="text", outputs="text")
demo.launch(server_name="0.0.0.0", server_port= 7860)
```
The above code creates a **gradio.Interface** called `demo`. It wraps the `greet` function with a simple text-to-text user interface that you could interact with.
The **gradio.Interface** class is initialized with 3 required parameters:
- fn: the function to wrap a UI around
- inputs: which component(s) to use for the input (e.g. "text", "image" or "audio")
- outputs: which component(s) to use for the output (e.g. "text", "image" or "label")
The last line `demo.launch()` launches a server to serve our `demo`.
## Launching the demo app
Now go back to the terminal and make sure that the `my_env` virtual environment name is displayed at the beginning of the line
Next, run the following command to execute the Python script.
```bash
python3 hello.py
```
As the Python code is served by a local host, click on the button below and you will be able to see the simple application we just created. Feel free to play around with the input and output of the web app!
Click here to see the application:
::startApplication{port="7860" display="internal" name="Web application" route="/"}
You should see the following, here we entered the name Bob:

If you finish playing with the app and want to exit, **press `Ctrl+c` in the terminal and close the application tab**.
If you wish to learn a little bit more about customization in Gradio, you are invited to take the guided project called **Bring your Machine Learning model to life with Gradio**. You can find it under **Courses & Projects** on [cognitiveclass.ai](https://cognitiveclass.ai/)!
For the rest of this project, we will use Gradio as an interface for LLM apps.
::page{title="Step 2: Creating audio transcription app"}
Create a new python file `speech2text_app.py`
::openFile{path="speech2text\_app.py"}
#### Exercise: Complete the `transcript_audio` function.
From the step1: fill the missing parts in `transcript_audio` function.
```python
import torch
from transformers import pipeline
import gradio as gr
# Function to transcribe audio using the OpenAI Whisper model
def transcript_audio(audio_file):
# Initialize the speech recognition pipeline
pipe = #-----> Fill here <----
# Transcribe the audio file and return the result
result = #-----> Fill here <----
return result
# Set up Gradio interface
audio_input = gr.Audio(sources="upload", type="filepath") # Audio input
output_text = gr.Textbox() # Text output
# Create the Gradio interface with the function, inputs, and outputs
iface = gr.Interface(fn=transcript_audio,
inputs=audio_input, outputs=output_text,
title="Audio Transcription App",
description="Upload the audio file")
# Launch the Gradio app
iface.launch(server_name="0.0.0.0", server_port=7860)
```
<details>
<summary>Click here for the answer</summary>
import torch
from transformers import pipeline
import gradio as gr
# Function to transcribe audio using the OpenAI Whisper model
def transcript_audio(audio_file):
# Initialize the speech recognition pipeline
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-tiny.en",
chunk_length_s=30,
)
# Transcribe the audio file and return the result
result = pipe(audio_file, batch_size=8)["text"]
return result
# Set up Gradio interface
audio_input = gr.Audio(sources="upload", type="filepath") # Audio input
output_text = gr.Textbox() # Text output
# Create the Gradio interface with the function, inputs, and outputs
iface = gr.Interface(fn=transcript_audio,
inputs=audio_input, outputs=output_text,
title="Audio Transcription App",
description="Upload the audio file")
# Launch the Gradio app
iface.launch(server_name="0.0.0.0", server_port=7860)
</details>
Then, run your app:
```bash
python3 speech2text_app.py
```
And start the app:
::startApplication{port="7860" display="internal" name="Web application" route="/"}
You can download the sample audio file we\'ve provided by right-clicking on it in the file explorer and selecting \"Download.\" Once downloaded, you can upload this file to the app. Alternatively, feel free to choose and upload any MP3 audio file from your local computer.
The result will be:
<center>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX04C6EN/images/speech2text_app.jpg" width="600px" alt="langchain">
</center>
__Press Ctrl + C to stop the application.__
::page{title="Step 3: Integrating LLM: Using Llama 3 in WatsonX as LLM"}
## Running simple LLM
Let\'s start by generating text with LLMs. Create a Python file and name it `simple_llm.py`. You can proceed by clicking the link below or by referencing the accompanying image.
::openFile{path="simple\_llm.py"}
In case, you want to use Llama 3 as an LLM instance, you can follow the instructions below:
> IBM WatsonX utilizes various language models, including Llama 3 by Meta, which is currently the strongest open-source language model.
Here\'s how the code works:
1. **Setting up credentials**: The credentials needed to access IBM\'s services are pre-arranged by the Skills Network team, so you don\'t have to worry about setting them up yourself.
2. **Specifying parameters**: The code then defines specific parameters for the language model. \'MAX_NEW_TOKENS\' sets the limit on the number of words the model can generate in one go. \'TEMPERATURE\' adjusts how creative or predictable the generated text is.
3. **Setting up Llama 3 model**: Next, the LLAMA3 model is set up using a model ID, the provided credentials, chosen parameters, and a project ID.
4. **Creating an object for Llama 3**: The code creates an object named llm, which is used to interact with the Llama 3 model. A model object, LLAMA3_model, is created using the Model class, which is initialized with a specific model ID, credentials, parameters, and project ID. Then, an instance of WatsonxLLM is created with LLAMA3_model as an argument, initializing the language model hub llm object.
5. **Generating and printing response**: Finally, \'llm\' is used to generate a response to the question, \"How to read a book effectively?\" The response is then printed out.
```python
from ibm_watson_machine_learning.foundation_models import Model
from ibm_watson_machine_learning.foundation_models.extensions.langchain import WatsonxLLM
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
my_credentials = {
"url" : "https://us-south.ml.cloud.ibm.com"
}
params = {
GenParams.MAX_NEW_TOKENS: 700, # The maximum number of tokens that the model can generate in a single run.
GenParams.TEMPERATURE: 0.1, # A parameter that controls the randomness of the token generation. A lower value makes the generation more deterministic, while a higher value introduces more randomness.
}
LLAMA2_model = Model(
model_id= 'meta-llama/llama-3-2-11b-vision-instruct',
credentials=my_credentials,
params=params,
project_id="skills-network",
)
llm = WatsonxLLM(LLAMA2_model)
print(llm("How to read a book effectively?"))
```
You can then run this script in the terminal using the following command:
```bash
python3 simple_llm.py
```
Upon running the script, you should see the generated text in your terminal, as shown below:
<div style="text-align:center">
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX04C6EN/images/llama2_read_book.jpg
" width="900" alt="">
</div>
You can see how watsonx Llama 2 provides a good answer.
::page{title="Step 4: Put them all together"}
Create a new Python file and call it `speech_analyzer.py`
::openFile{path="speech\_analyzer.py"}
In this exercise, we\'ll set up a language model (LLM) instance, which could be IBM WatsonxLLM, HuggingFaceHub, or an OpenAI model. Then, we\'ll establish a prompt template. These templates are structured guides to generate prompts for language models, aiding in output organization (more info in [langchain prompt template](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/).
Next, we\'ll develop a transcription function that employs the OpenAI Whisper model to convert speech-to-text. This function takes an audio file uploaded through a Gradio app interface (preferably in .mp3 format). The transcribed text is then fed into an LLMChain, which integrates the text with the prompt template and forwards it to the chosen LLM. The final output from the LLM is then displayed in the Gradio app\'s output textbox.
The output should look:
<div style="text-align:center">
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX04C6EN/images/final_result_whisper.jpg
" width="700" alt="">
</div>
Notice how the LLM corrected a minor mistake made by the speech-to-text model, resulting in a coherent and accurate output.
## Exercise: Fill the missing parts:
```python
import torch
import os
import gradio as gr
#from langchain.llms import OpenAI
from langchain.llms import HuggingFaceHub
from transformers import pipeline
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from ibm_watson_machine_learning.foundation_models.extensions.langchain import WatsonxLLM
from ibm_watson_machine_learning.foundation_models.utils.enums import DecodingMethods
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watson_machine_learning.foundation_models import Model
#######------------- LLM-------------####
# initiate LLM instance, this can be IBM WatsonX, huggingface, or OpenAI instance
llm = ###---> write your code here
#######------------- Prompt Template-------------####
# This template is structured based on LLAMA2. If you are using other LLMs, feel free to remove the tags
temp = """
<s><<SYS>>
List the key points with details from the context:
[INST] The context : {context} [/INST]
<</SYS>>
"""
# here is the simplified version of the prompt template
# temp = """
# List the key points with details from the context:
# The context : {context}
# """
pt = PromptTemplate(
input_variables=["context"],
template= temp)
prompt_to_LLAMA2 = LLMChain(llm=llm, prompt=pt)
#######------------- Speech2text-------------####
def transcript_audio(audio_file):
# Initialize the speech recognition pipeline
pipe = #------> write the code here
# Transcribe the audio file and return the result
transcript_txt = pipe(audio_file, batch_size=8)["text"]
# run the chain to merge transcript text with the template and send it to the LLM
result = prompt_to_LLAMA2.run(transcript_txt)
return result
#######------------- Gradio-------------####
audio_input = gr.Audio(sources="upload", type="filepath")
output_text = gr.Textbox()
# Create the Gradio interface with the function, inputs, and outputs
iface = #---> write code here
iface.launch(server_name="0.0.0.0", server_port=7860)
```
<details>
<summary>Click here for the answer</summary>
import torch
import os
import gradio as gr
#from langchain.llms import OpenAI
from langchain.llms import HuggingFaceHub
from transformers import pipeline
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from ibm_watson_machine_learning.foundation_models import Model
from ibm_watson_machine_learning.foundation_models.extensions.langchain import WatsonxLLM
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
my_credentials = {
"url" : "https://us-south.ml.cloud.ibm.com"
}
params = {
GenParams.MAX_NEW_TOKENS: 800, # The maximum number of tokens that the model can generate in a single run.
GenParams.TEMPERATURE: 0.1, # A parameter that controls the randomness of the token generation. A lower value makes the generation more deterministic, while a higher value introduces more randomness.
}
LLAMA2_model = Model(
model_id= 'meta-llama/llama-3-2-11b-vision-instruct',
credentials=my_credentials,
params=params,
project_id="skills-network",
)
llm = WatsonxLLM(LLAMA2_model)
#######------------- Prompt Template-------------####
temp = """
<s><<SYS>>
List the key points with details from the context:
[INST] The context : {context} [/INST]
<</SYS>>
"""
pt = PromptTemplate(
input_variables=["context"],
template= temp)
prompt_to_LLAMA2 = LLMChain(llm=llm, prompt=pt)
#######------------- Speech2text-------------####
def transcript_audio(audio_file):
# Initialize the speech recognition pipeline
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-tiny.en",
chunk_length_s=30,
)
# Transcribe the audio file and return the result
transcript_txt = pipe(audio_file, batch_size=8)["text"]
result = prompt_to_LLAMA2.run(transcript_txt)
return result
#######------------- Gradio-------------####
audio_input = gr.Audio(sources="upload", type="filepath")
output_text = gr.Textbox()
iface = gr.Interface(fn= transcript_audio,
inputs= audio_input, outputs= output_text,
title= "Audio Transcription App",
description= "Upload the audio file")
iface.launch(server_name="0.0.0.0", server_port=7860)
</details>
Run your code:
```bash
python3 speech_analyzer.py
```
If there is no error, run the web app:
::startApplication{port="7860" display="internal" name="Web application" route="/"}
::page{title="Conclusion"}
Congratulations on completing this project! You have now laid a solid foundation for leveraging powerful Language Models (LLMs) for speech-to-text generation tasks. Here\'s a quick recap of what you\'ve accomplished:
- Text generation with LLM: You\'ve created a Python script to generate text using a model from the Hugging Face Hub, learned about some key parameters that influence the model\'s output, and have a basic understanding of how to switch between different LLM models.
- Speech-to-Text conversion: Utilize OpenAI\'s Whisper technology to convert lecture recordings into text, accurately.
- Content summarization: Implement IBM Watson\'s AI to effectively summarize the transcribed lectures and extract key points.
- User interface development: Create an intuitive and user-friendly interface using Hugging Face Gradio, ensuring ease of use for students and educators.
## Author(s)
#### Sina Nazeri
## <h3 align="center"> © IBM Corporation. All rights reserved. <h3/>
<!--
## Change log
| Date | Version | Changed by | Change Description |
|------|--------|--------|---------|
| 2024-02-15 | 0.1 | Sina Nazeri| Created project first draft |
| 2024-03-28 | 0.2 | Javed Ansari | Fixed QA comments |
| 2024-04-16| 0.3 | Anita Narain | ID Reviewed |
| 2025-02-16| 0.4 | Vandana Pandey | Changed the model|
-->
Consider you're attending a business meeting where all conversations are being captured by an advanced AI application. This application not only transcribes the discussions with high accuracy but also provides a concise summary of the meeting, emphasizing the key points and decisions made.
In our project, we'll use OpenAI's Whisper to transform speech into text. Next, we'll use IBM Watson's AI to summarize and find key points. We'll make an app with Hugging Face Gradio as the user interface.
After finishing this lab, you will able to:
Let's start with setting up the environment by creating a Python virtual environment and installing the required libraries, using the following commands in the terminal:
- 1
- 2
- 3
pip3 install virtualenvvirtualenv my_env # create a virtual environment my_envsource my_env/bin/activate # activate my_env
Then, install the required libraries in the environment (this will take time ☕️☕️):
- 1
- 2
# installing required libraries in my_envpip install transformers==4.35.2 torch==2.1.1 gradio==4.44.0 langchain==0.0.343 ibm_watson_machine_learning==1.0.335 huggingface-hub==0.19.4
Have a cup of coffee, it will take a few minutes.
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
) (( ) )) ( (_______)_.-'---------|( C|/\/\/\/\/|'-./\/\/\/\/|'_________''-------'
We need to install ffmpeg to be able to work with audio files in python.
- 1
sudo apt update
Then run:
- 1
sudo apt install ffmpeg -y
Whisper from OpenAI is available in github. Whisper's code and model weights are released under the MIT License. See LICENSE for further details.
Initially, we want to create a simple speech-to-text Python file using OpenAI Whisper.
You can test the sample audio file Sample voice link to download.
Create and open a Python file and call it simple_speech2text.py by clicking the link below:
Let's download the file first (you can do it manually, then drag and drop it into the file environment).
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
import requests# URL of the audio file to be downloadedurl = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX04C6EN/Testing%20speech%20to%20text.mp3"# Send a GET request to the URL to download the fileresponse = requests.get(url)# Define the local file path where the audio file will be savedaudio_file_path = "downloaded_audio.mp3"# Check if the request was successful (status code 200)if response.status_code == 200:# If successful, write the content to the specified local file pathwith open(audio_file_path, "wb") as file:file.write(response.content)print("File downloaded successfully")else:# If the request failed, print an error messageprint("Failed to download the file")
Run the Python file to test it.
- 1
python3 simple_speech2text.py
You should see the downloaded audio file in the file explorer.
Next, implement OpenAI Whisper for transcribing voice to speech.
You can override the previous code in the Python file.
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
import torchfrom transformers import pipeline# Initialize the speech-to-text pipeline from Hugging Face Transformers# This uses the "openai/whisper-tiny.en" model for automatic speech recognition (ASR)# The `chunk_length_s` parameter specifies the chunk length in seconds for processingpipe = pipeline("automatic-speech-recognition",model="openai/whisper-tiny.en",chunk_length_s=30,)# Define the path to the audio file that needs to be transcribedsample = 'downloaded_audio.mp3'# Perform speech recognition on the audio file# The `batch_size=8` parameter indicates how many chunks are processed at a time# The result is stored in `prediction` with the key "text" containing the transcribed textprediction = pipe(sample, batch_size=8)["text"]# Print the transcribed text to the consoleprint(prediction)
Run the Python file and you will get the output.
- 1
python3 simple_speech2text.py
In the next step, we will utilize Gradio for creating interface for our app.
Through this project, we will create different LLM applications with Gradio interface. Let's get familiar with Gradio by creating a simple app:
Still in the project directory, create a Python file and name it hello.py.
Open hello.py, paste the following Python code and save the file.
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
import gradio as grdef greet(name):return "Hello " + name + "!"demo = gr.Interface(fn=greet, inputs="text", outputs="text")demo.launch(server_name="0.0.0.0", server_port= 7860)
The above code creates a gradio.Interface called demo. It wraps the greet function with a simple text-to-text user interface that you could interact with.
The gradio.Interface class is initialized with 3 required parameters:
The last line demo.launch() launches a server to serve our demo.
Now go back to the terminal and make sure that the my_env virtual environment name is displayed at the beginning of the line
Next, run the following command to execute the Python script.
- 1
python3 hello.py
As the Python code is served by a local host, click on the button below and you will be able to see the simple application we just created. Feel free to play around with the input and output of the web app!
Click here to see the application:
You should see the following, here we entered the name Bob:
If you finish playing with the app and want to exit, press Ctrl+c in the terminal and close the application tab.
If you wish to learn a little bit more about customization in Gradio, you are invited to take the guided project called Bring your Machine Learning model to life with Gradio. You can find it under Courses & Projects on cognitiveclass.ai!
For the rest of this project, we will use Gradio as an interface for LLM apps.
Create a new python file speech2text_app.py
transcript_audio function.
From the step1: fill the missing parts in transcript_audio function.
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
import torchfrom transformers import pipelineimport gradio as gr# Function to transcribe audio using the OpenAI Whisper modeldef transcript_audio(audio_file):# Initialize the speech recognition pipelinepipe = #-----> Fill here <----# Transcribe the audio file and return the resultresult = #-----> Fill here <----return result# Set up Gradio interfaceaudio_input = gr.Audio(sources="upload", type="filepath") # Audio inputoutput_text = gr.Textbox() # Text output# Create the Gradio interface with the function, inputs, and outputsiface = gr.Interface(fn=transcript_audio,inputs=audio_input, outputs=output_text,title="Audio Transcription App",description="Upload the audio file")# Launch the Gradio appiface.launch(server_name="0.0.0.0", server_port=7860)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
import torchfrom transformers import pipelineimport gradio as gr# Function to transcribe audio using the OpenAI Whisper modeldef transcript_audio(audio_file):# Initialize the speech recognition pipelinepipe = pipeline("automatic-speech-recognition",model="openai/whisper-tiny.en",chunk_length_s=30,)# Transcribe the audio file and return the resultresult = pipe(audio_file, batch_size=8)["text"]return result# Set up Gradio interfaceaudio_input = gr.Audio(sources="upload", type="filepath") # Audio inputoutput_text = gr.Textbox() # Text output# Create the Gradio interface with the function, inputs, and outputsiface = gr.Interface(fn=transcript_audio,inputs=audio_input, outputs=output_text,title="Audio Transcription App",description="Upload the audio file")# Launch the Gradio appiface.launch(server_name="0.0.0.0", server_port=7860)
Then, run your app:
- 1
python3 speech2text_app.py
And start the app:
You can download the sample audio file we've provided by right-clicking on it in the file explorer and selecting "Download." Once downloaded, you can upload this file to the app. Alternatively, feel free to choose and upload any MP3 audio file from your local computer.
The result will be:
Press Ctrl + C to stop the application.
Let's start by generating text with LLMs. Create a Python file and name it simple_llm.py. You can proceed by clicking the link below or by referencing the accompanying image.
In case, you want to use Llama 3 as an LLM instance, you can follow the instructions below:
IBM WatsonX utilizes various language models, including Llama 3 by Meta, which is currently the strongest open-source language model.
Here's how the code works:
Setting up credentials: The credentials needed to access IBM's services are pre-arranged by the Skills Network team, so you don't have to worry about setting them up yourself.
Specifying parameters: The code then defines specific parameters for the language model. 'MAX_NEW_TOKENS' sets the limit on the number of words the model can generate in one go. 'TEMPERATURE' adjusts how creative or predictable the generated text is.
Setting up Llama 3 model: Next, the LLAMA3 model is set up using a model ID, the provided credentials, chosen parameters, and a project ID.
Creating an object for Llama 3: The code creates an object named llm, which is used to interact with the Llama 3 model. A model object, LLAMA3_model, is created using the Model class, which is initialized with a specific model ID, credentials, parameters, and project ID. Then, an instance of WatsonxLLM is created with LLAMA3_model as an argument, initializing the language model hub llm object.
Generating and printing response: Finally, 'llm' is used to generate a response to the question, "How to read a book effectively?" The response is then printed out.
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
from ibm_watson_machine_learning.foundation_models import Modelfrom ibm_watson_machine_learning.foundation_models.extensions.langchain import WatsonxLLMfrom ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParamsmy_credentials = {"url" : "https://us-south.ml.cloud.ibm.com"}params = {GenParams.MAX_NEW_TOKENS: 700, # The maximum number of tokens that the model can generate in a single run.GenParams.TEMPERATURE: 0.1, # A parameter that controls the randomness of the token generation. A lower value makes the generation more deterministic, while a higher value introduces more randomness.}LLAMA2_model = Model(model_id= 'meta-llama/llama-3-2-11b-vision-instruct',credentials=my_credentials,params=params,project_id="skills-network",)llm = WatsonxLLM(LLAMA2_model)print(llm("How to read a book effectively?"))
You can then run this script in the terminal using the following command:
- 1
python3 simple_llm.py
Upon running the script, you should see the generated text in your terminal, as shown below:
You can see how watsonx Llama 2 provides a good answer.
Create a new Python file and call it speech_analyzer.py
In this exercise, we'll set up a language model (LLM) instance, which could be IBM WatsonxLLM, HuggingFaceHub, or an OpenAI model. Then, we'll establish a prompt template. These templates are structured guides to generate prompts for language models, aiding in output organization (more info in langchain prompt template.
Next, we'll develop a transcription function that employs the OpenAI Whisper model to convert speech-to-text. This function takes an audio file uploaded through a Gradio app interface (preferably in .mp3 format). The transcribed text is then fed into an LLMChain, which integrates the text with the prompt template and forwards it to the chosen LLM. The final output from the LLM is then displayed in the Gradio app's output textbox.
The output should look:
Notice how the LLM corrected a minor mistake made by the speech-to-text model, resulting in a coherent and accurate output.
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
import torchimport osimport gradio as gr#from langchain.llms import OpenAIfrom langchain.llms import HuggingFaceHubfrom transformers import pipelinefrom langchain.prompts import PromptTemplatefrom langchain.chains import LLMChainfrom ibm_watson_machine_learning.foundation_models.extensions.langchain import WatsonxLLMfrom ibm_watson_machine_learning.foundation_models.utils.enums import DecodingMethodsfrom ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParamsfrom ibm_watson_machine_learning.foundation_models import Model#######------------- LLM-------------##### initiate LLM instance, this can be IBM WatsonX, huggingface, or OpenAI instancellm = ###---> write your code here#######------------- Prompt Template-------------##### This template is structured based on LLAMA2. If you are using other LLMs, feel free to remove the tagstemp = """<s><<SYS>>List the key points with details from the context:[INST] The context : {context} [/INST]<</SYS>>"""# here is the simplified version of the prompt template# temp = """# List the key points with details from the context:# The context : {context}# """pt = PromptTemplate(input_variables=["context"],template= temp)prompt_to_LLAMA2 = LLMChain(llm=llm, prompt=pt)#######------------- Speech2text-------------####def transcript_audio(audio_file):# Initialize the speech recognition pipelinepipe = #------> write the code here# Transcribe the audio file and return the resulttranscript_txt = pipe(audio_file, batch_size=8)["text"]# run the chain to merge transcript text with the template and send it to the LLMresult = prompt_to_LLAMA2.run(transcript_txt)return result#######------------- Gradio-------------####audio_input = gr.Audio(sources="upload", type="filepath")output_text = gr.Textbox()# Create the Gradio interface with the function, inputs, and outputsiface = #---> write code hereiface.launch(server_name="0.0.0.0", server_port=7860)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
import torchimport osimport gradio as gr#from langchain.llms import OpenAIfrom langchain.llms import HuggingFaceHubfrom transformers import pipelinefrom langchain.prompts import PromptTemplatefrom langchain.chains import LLMChainfrom ibm_watson_machine_learning.foundation_models import Modelfrom ibm_watson_machine_learning.foundation_models.extensions.langchain import WatsonxLLMfrom ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParamsmy_credentials = {"url" : "https://us-south.ml.cloud.ibm.com"}params = {GenParams.MAX_NEW_TOKENS: 800, # The maximum number of tokens that the model can generate in a single run.GenParams.TEMPERATURE: 0.1, # A parameter that controls the randomness of the token generation. A lower value makes the generation more deterministic, while a higher value introduces more randomness.}LLAMA2_model = Model(model_id= 'meta-llama/llama-3-2-11b-vision-instruct',credentials=my_credentials,params=params,project_id="skills-network",)llm = WatsonxLLM(LLAMA2_model)#######------------- Prompt Template-------------####temp = """<s><<SYS>>List the key points with details from the context:[INST] The context : {context} [/INST]<</SYS>>"""pt = PromptTemplate(input_variables=["context"],template= temp)prompt_to_LLAMA2 = LLMChain(llm=llm, prompt=pt)#######------------- Speech2text-------------####def transcript_audio(audio_file):# Initialize the speech recognition pipelinepipe = pipeline("automatic-speech-recognition",model="openai/whisper-tiny.en",chunk_length_s=30,)# Transcribe the audio file and return the resulttranscript_txt = pipe(audio_file, batch_size=8)["text"]result = prompt_to_LLAMA2.run(transcript_txt)return result#######------------- Gradio-------------####audio_input = gr.Audio(sources="upload", type="filepath")output_text = gr.Textbox()iface = gr.Interface(fn= transcript_audio,inputs= audio_input, outputs= output_text,title= "Audio Transcription App",description= "Upload the audio file")iface.launch(server_name="0.0.0.0", server_port=7860)
Run your code:
- 1
python3 speech_analyzer.py
If there is no error, run the web app:
Congratulations on completing this project! You have now laid a solid foundation for leveraging powerful Language Models (LLMs) for speech-to-text generation tasks. Here's a quick recap of what you've accomplished:
Text generation with LLM: You've created a Python script to generate text using a model from the Hugging Face Hub, learned about some key parameters that influence the model's output, and have a basic understanding of how to switch between different LLM models.
Speech-to-Text conversion: Utilize OpenAI's Whisper technology to convert lecture recordings into text, accurately.
Content summarization: Implement IBM Watson's AI to effectively summarize the transcribed lectures and extract key points.
User interface development: Create an intuitive and user-friendly interface using Hugging Face Gradio, ensuring ease of use for students and educators.